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Abstract 

As language and visual understanding by machines progresses rapidly, we are ob¬ 
serving an increasing interest in holistic architectures that tightly interlink both 
modalities in a joint learning and inference process. This trend has allowed the 
community to progress towards more challenging and open tasks and refueled the 
hope at achieving the old AI dream of building machines that could pass a turing 
test in open domains. In order to steadily make progress towards this goal, we real¬ 
ize that quantifying performance becomes increasingly difficult. Therefore we ask 
how we can precisely define such challenges and how we can evaluate different 
algorithms on this open tasks? In this paper, we summarize and discuss such chal¬ 
lenges as well as try to give answers where appropriate options are available in the 
literature. We exemplify some of the solutions on a recently presented dataset of 
question-answering task based on real-world indoor images that establishes a vi¬ 
sual turing challenge. Finally, we argue despite the success of unique ground-truth 
annotation, we likely have to step away from carefully curated dataset and rather 
rely on ’social consensus’ as the main driving force to create suitable benchmarks. 
Providing coverage in this inherently ambiguous output space is an emerging chal¬ 
lenge that we face in order to make quantifiable progress in this area. 

1 Introduction 

Recently we witness a tremendous progress in the machine perception [1, 2, 3, 4, 5, 6, 7, 8] and 
in the language understanding [9, 10, 11, 12, 13] tasks. The progress in both fields has inspired 
researchers to build holistic architectures for challenging grounding [14, 15], natural language gen¬ 
eration from image/video [16, 17, 18], image-to-sentence alignment [19, 20, 21, 22], and recently 
presented question-answering problems [23, 24, 25, 26, 27]. In this paper we argue for a Visual 
Turing Test - an open domain task of question-answering based on real-world images that resem¬ 
blances the famous Turing Test [28, 29] and deviates from other attempts [30, 31, 32] - and discuss 
challenges together with tools to benchmark different models on such task. 

We typically measure the progress in the field by quantifying the performance of different methods 
against a carefully crafted set of benchmarks. Crowdsourcing in combination of machine learning 
approaches have served us well to generate curated datasets with a unique ground truth at scale 
[33, 34], As the complexity and the openness of the task grows, the quest of crafting good bench¬ 
marks also becomes more difficult. First, interpreting and evaluating the answer of a system be¬ 
comes increasingly difficult and ideally would rely on human judgement. Yet we want to have 
objective metrics that we can evaluate automatically at large scale. Second, establishing an eval¬ 
uation methodology that assigns scores over a large output domain is challenging, as any system 
based on ontologies will have limited coverage. Third, if our aim is to mimic human response, we 
have to deal with inherent ambiguities due to human judgement that stem from issues like binding, 
reference frames, social conventions. For instance [27] reports that for a question answering task 
on real-world images even human answers are inconsistent. Obviously this cannot be a problem of 
humans but rather argues for inherent ambiguities in the task. 

Competing methods are validated against true annotations, but what is the “truth” in a task where 
even human answers cannot completely agree with each other? Instead of seeking an unique, “true” 
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answer we suggest to look into ’social consensus’ that takes multiple human answers as different 
interpretations of the question into account. This enables us to incorporate ’agreement’ between the 
humans directly into the metric. Although the idea is not entirely new [35, 36, 37], we believe it sits 
at the core of building more open and holistic challenges. 

We exemplify some of our findings on the DAQUAR dataset [27] with the aim of demonstrating 
different challenges that are present in the dataset. We hope that our exposition is helpful towards 
building a public visual turing challenge and will generate a discussion for the agreeable evaluation 
procedure and designing systems that can address open domain tasks. 

In this paper holistic architecture (also holistic learner) is a machine learning architecture designed 
to work on the task that fuses at least two modalities, e.g. language and vision. The external world 
is a part of a task accessible to the holistic learner only via sensors and it can be either human world 
(the world that surrounds us), or a machine world that models some aspects of human world. 

2 Challenges 

As we strive for more holistic and open tasks such as grounding or question-answering based on 
images, we need to deal with a large gamut of challenges. In this section we have distilled and 
discuss some of the most prominent ones in order to guide the further discussion. 

Vision and language Scalability: Perception and natural language understanding are crucial parts 
of holistic reasoning as they ground any representation in the external world and therefore serve as 
a common reference point for machines and humans. The human conceptualization divides these 
percepts into different instances, categories as well as spatio-temporal concepts. Architectures that 
aim at mimicking or reproducing this space of human concepts need to capture the same diversity 
and therefore scale up to thousands of concepts [38, 39, 40], 

Concept ambiguity: As the number of categories grows, the semantic boundaries become more 
fuzzy, and hence ambiguities are inherently introduced [41, 42], For instance, sometimes we may 
overlook the difference between ’night stand’ and ’cabinet’, or ’armchair’ and ’sofa’. Therefore it is 
reasonable to expect from the holistic architectures to create alternative hypotheses of the external 
world during inference. This also relates to the gradual category membership in human perception 
as portrayed in the prototype theory [41, 43], 

Attributes: The human concepts are not limited to object categories, but also include attributes such 
as genders, colors, states (lights can be either on or off). Often these concepts cannot be learned on 
their own, but rather are contextualized by the associated noun. E.g. white in “white” elephant is 
surly different from “white” in white snow. 

Ambiguity in reference resolution: Reliably answering on questions is challenging even for humans. 
The quality of an answer depends on how ambiguous and latent notions of reference frames and 
intentions are understood [27, 44], Depending on the cultural bias and the context, we may use 
object-centric or observer-centric or even world-centric frames of reference [45]. Moreover, it is 
even unclear what ’with’, ’beneath’, ’over’ mean. It seems at least difficult to symbolically define 
them in terms of predicates. While holistic learning and inference encompassing all the afore¬ 
mentioned aspects has yet to be shown, current research directions show promise [46, 47, 48] by 
adapting the symbolic-based approaches [10, 11, 23, 24] with vector-based approaches [12, 19, 25] 
to represent the meaning. 

Common sense knowledge It turns out that some questions can solely be answered with the access 
to common sense knowledge with high reliability. For instance ’’Which object on the table is used 
for cutting?” already narrows the likely options significantly and the correct answer is probably 
“knife” or “scissors”. Other questions like ’’Which hand of the teacher is on her chin?” require the 
mixture of the vision and language. To understand the question, a holistic learner needs to first detect 
a person, figure out that the person may be a teacher, understand a gender of the person, detect her 
chin, understand ’left’ and ’right’ side, and finally relates ’her’ with the ’teacher’. 

However, different parts of the common sense knowledge can be used with different modality. An 
’object for cutting’ is not about seeing but about the affordance of the object and it cannot be learnt 
solely from the set of images. On the other hand things that often co-occur together may stand for 
the visual-based common sense knowledge. For instance we may expect to find a scissor or a pen 
inside a small plastic box, but never a wall or a window. 

Common sense knowledge can help holistic machine learning architectures to either fulfill the task 
(question ’’Which object on the table is used for cutting?” can utilizes this type of knowledge), or 
limit the hypothesis space and hence to reduce the computational complexity of the search problem. 


2 



For instance an architecture could be guided by its common sense knowledge to limit the space of 
possible locations of the ’scissors’ and answer on ’’What is in front of scissors?” more effectively. 

Defining a benchmark dataset and quantifying performance We argue that the question an¬ 
swering based on the visual input task significantly differ from the grounding problem and has 
unique advantages towards defining a challenge dataset. Most prominently, the latter is about find¬ 
ing (either with a hand-crafted set of rules or learnt-based approaches) a mapping between the lin¬ 
guistic fragments and the physical world [14, 15, 49], whereas the question answering task is about 
an end-to-end system where we do not necessarily want to enforce any constraints or penalty for 
the internal representation of the holistic learner. In this sense grounding is a latent sub-task that 
the holistic learner needs to solve, but will not be evaluated on. Finally, we argue that establishing 
benchmark dataset based on a question answering task similar to a turing test, is more tractable. 
Learning grounding asks for exhaustive symbolic-based annotations of the world, while question 
answering only needs textual annotations for the aspects that the question refers to. 

3 DAQUAR: Building a Dataset for Visual Turing Challenge 

DAQUAR [27] is a challenging, large dataset for a question answering task based on real-world 
images. The images present real-world indoor scenes [50], while the questions are unconstrained 
natural language sentences. DAQUAR’s language scope is beyond the nouns or tuples that are 
typical to recognition datasets [51, 52, 53]. Other, linguistically rich datasets either do not tackle 
images at all [54, 55] or consider only few in very constrained domain [15], or are more suitable for 
the learning an embedding/image-sentence retrieval or language generation [22, 56, 57, 58]. In this 
section we discuss in isolation different challenges reflected in DAQUAR. 

Vision and language The machine world in DAQUAR is represented as a set of images and ques¬ 
tions about their content. DAQUAR contains 1088 different nouns in the question, 803 in the an¬ 
swers, and 1586 altogether (we use the Stanford POS Tagger [59] to extract the nouns from the 
questions). If we consider only nouns in singular form in the questions, we still have 573 cate¬ 
gories. The current state-of-the-art semantic segmentation methods on the NYU-Depth V2 dataset 
[50] can discriminate only between up to 37 object categories [2, 60, 61], much fewer to what is 
needed. DAQUAR also contains other parts of speech where only colors and spatial prepositions are 
grounded in [27]. 

Moreover, ambiguities naturally emerge due to fine grained categories that exist in DAQUAR. For 
instance ’night stand’, ’stool’ and ’cabinet’ sometimes refer to the same thing. There is also a 
variation in the naming of colors among the annotations. Questions rely heavily on the spatial 
concepts with different frame of reference. 

DAQUAR includes various challenges related to natural language understanding. Any semantic rep¬ 
resentation needs to work with the large number of predicates (reaching about 4 million to account 
different interpretations of the external world), with questions of substantial length (10.5 words in 
average with variance 5.5; the longest question has 30 words), and possible language errors in the 
questions. 

Common sense knowledge DAQUAR includes questions that can be reliably answered using 
common sense knowledge. For instance ’’Which object on the table is used for cutting?” already 
provides strong non-visual cues for the “cutting” object. Answers on other questions, such as ’’What 
is above the desk in front of scissors?”, can be improved if the search space is reasonable restricted. 
Moreover, some annotators hypothesize missing parts of the object based on their common sense. To 
sum up, we believe that common sense knowledge is an interesting venue to explore with DAQUAR. 

Question answering task The question answering task is also about understanding hidden inten¬ 
tions of the questioner with grounding as a sub-goal to solve. Some authors [23, 24, 27] treat the 
grounding (understood here as the logical representation of the meaning of the question) as a la¬ 
tent variable in the question answering task. Others [44] have modeled the pragmatic effects in the 
question answering task, but such approaches have never been shown to work in less constrained 
environments. 

4 Quantifying the Performance of Holistic Architectures 

Together with increasing complexity and openness of the task, quantifying performance of the holis¬ 
tic architectures becomes challenging due to several issues: 
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Automation: Evaluating answers on such complex tasks as answering on questions requires a quite 
deep understanding of natural language, involved concepts and hidden intentions of the questioner. 
The ideal but impractical metric would be to manually judge every single answer of every architec¬ 
ture individually. Since this is infeasible we are seeking an automatic approximation so that we can 
evaluate different holistic architectures at scale. 

Ambiguity: The complex tasks that we are interested in are inherently ambiguous. The ambiguities 
stem from cultural bias, different frame of reference and fined grained categorization. This implies 
that multiple interpretations of a question are possible and hence many correct answers. 

Coverage: Since there are multiple ways of expressing the same concept, the automatic performance 
metric should take the equivalence class among the answers into the consideration by assigning sim¬ 
ilar scores to all members of the same class. There are attempts to alleviate this issue via defining 
similarity scores [62] over the lexical databases [63, 64], These approaches, however, lacks of cov¬ 
erage: we cannot assign a similarity between the terms that are not represented in the structure. 

WUPS scores We exemplify the aforementioned requirements by illustrating the WUPS score - 
an automatic metric that quantifies performance of the holistic architectures proposed by [27]. This 
metric is motivated by the development of a ’soft’ generalization of accuracy that takes ambiguities 
of different concepts into account via the set membership measure /i: 
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where for each i-th question, A 1 and T l are the answers produced by the architecture and human re¬ 
spectively, and they are represented as bags of words. The authors of [27] have proposed using WUP 
similarity [62] as the membership measure // in the WUPS score. Such choice of /t suffers from the 
aforementioned coverage problem and the whole metric takes only one human interpretation of the 
question into account. 

Future directions for defining metrics Recent work provides several directions towards improv¬ 
ing scores. To deal with ambiguities that stem from different readings of the same question we are 
collecting more human answers per question and we propose, based on that, two generalizations of 
WUPS score. The first, we call Interpretation Metric, runs Eq. 1 over many human answers and 
takes the maximal score, so that the machine answer is high if it is similar to at least one human 
answer. However, with many human answers, we can also rank higher the machine answers that 
are ’socially agreeable’ by measuring if they agree with most human answers. This can be done by 
averaging over multiple human answers. We call such second extension. Consensus Metric. The 
problem with coverage can be potentially alleviated with vector based representations [12] of the 
answers. Although in this case the coverage issues are less problematic, we understand the con¬ 
cerns that such score is dependent on the training data used to build such representation. On the 
other hand, due to abundance of textual data and recent improvements of vector based approaches 
[12, 65], we consider it as a valid alternative to similarities that are based on ontologies. 

Experimental scenarios In many cases, success on challenging learning problems has been ac¬ 
celerated by use of external data in the training, e.g. in object detection [3], We believe that a Visual 
Turing challenge should consists of a sub-task with a prohibited use of auxiliary data to understand 
how the holistic learners generalize from limited and challenging data in a more established setup. 
On the other hand we should not limit ourselves to such artificial restrictions in building next gen¬ 
eration of the holistic learners. Therefore open sub-tasks with a permissible use of another sources 
in the training have to be stated, including: additional vision and language resources, synthetic data 
and curated questions. 


5 Summary 

The goal of this contribution is to sparkle the discussions about benchmarking holistic architectures 
on complex and more open tasks. We identify particular challenges that holistic tasks should ex¬ 
hibit and exemplify how they are manifested in a recent question answering challenge [27]. To 
judge competing architectures and measure the progress on the task, we suggest several directions 
to further improve existing metrics, and discuss different experimental scenarios. 
Acknowledgement: We would like to thank Michael Stark for his comments on the draft. 
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